Exploration of Different Imputation Methods for Missing Data
Karthik Aerra, Liz Miller, Mohit Kumar Veeraboina, Robert Stairs
2024-07-27
Introduction
Placeholder Title 1
Bullet 1
Bullet 2
Bullet 3
Placeholder Title 2
Bullet 1
Bullet 2
Bullet 3
Placeholder Title 3
Bullet 1
Bullet 2
Bullet 3
Placeholder Title 4
Bullet 1
Bullet 2
Bullet 3
Placeholder Title 5
Bullet 1
Bullet 2
Bullet 3
Methods
Placeholder Title 6
Bullet 1
Bullet 2
Bullet 3
Placeholder Title 7
Bullet 1
Bullet 2
Bullet 3
Placeholder Title 8
Bullet 1
Bullet 2
Bullet 3
Placeholder Title 9
Bullet 1
Bullet 2
Bullet 3
Analysis and Results
Introduction to the Dataset
“Ozone: Los Angeles Ozone Pollution Data, 1976” from mlbench package in R (“Ozone”)
It contains observations related to pollution levels in the Los Angeles area during 1976.
It contains a total of 13 variables, 366 observations (one for each day for one year)
We chose this dataset due to the high volume of already-missing data
Demonstration of a real-life scenario, where missing values are truly unknown and effectiveness of imputation methods cannot directly be measured.
It is up to the investigator to choose a methodology for handling the missing data and appropriate metrics for evaluating effectiveness of missing data methods
Variables in the Ozone Dataset
V1: Month, 1-12, where 1 is January and 12 is December
V2: Day of month
V3: Day of week, 1-7, where 1 is Monday and 7 is Sunday
V4: Daily maximum one-hour-average ozone reading
V5: 500 millibar pressure height (m) measured at Vandenberg AFB
V6: Wind speed (mph) at Los Angeles International Airport (LAX)
V7: Humidity (%) at LAX
V8: Temperature (degrees F) measured at Sandburg, CA
V9: Temperature (degrees F) measured at El Monte, CA
V10: Inversion base height (feet) at LAX
V11: Pressure gradient (mm Hg) from LAX to Daggett, CA
V12: Inversion base temperature (degrees F) at LAX
V13: Visibility (miles) measured at LAX
Summarizing the Data
The summary() function shows the number of NAs for each column
Most columns have <20 NAs, however V9 contains 139 missing values
V9: Temperature (degrees F) measured at El Monte, CA
V4 V1 V2 V3 V5
Min. : 1.00 1 : 31 1 : 12 1:52 Min. :5320
1st Qu.: 5.00 3 : 31 2 : 12 2:52 1st Qu.:5700
Median : 9.00 5 : 31 3 : 12 3:52 Median :5770
Mean :11.53 7 : 31 4 : 12 4:53 Mean :5753
3rd Qu.:16.00 8 : 31 5 : 12 5:53 3rd Qu.:5830
Max. :38.00 10 : 31 6 : 12 6:52 Max. :5950
NA's :5 (Other):180 (Other):294 7:52 NA's :12
V6 V7 V8 V9
Min. : 0.000 Min. :19.00 Min. :25.00 Min. :27.68
1st Qu.: 3.000 1st Qu.:49.00 1st Qu.:51.00 1st Qu.:49.73
Median : 5.000 Median :65.00 Median :62.00 Median :57.02
Mean : 4.869 Mean :58.48 Mean :61.91 Mean :56.85
3rd Qu.: 6.000 3rd Qu.:73.00 3rd Qu.:72.00 3rd Qu.:66.11
Max. :11.000 Max. :93.00 Max. :93.00 Max. :82.58
NA's :15 NA's :2 NA's :139
V10 V11 V12 V13
Min. : 111 Min. :-69.0 Min. :27.50 Min. : 0.0
1st Qu.: 890 1st Qu.:-10.0 1st Qu.:51.26 1st Qu.: 70.0
Median :2125 Median : 24.0 Median :62.24 Median :110.0
Mean :2591 Mean : 17.8 Mean :60.93 Mean :123.3
3rd Qu.:5000 3rd Qu.: 45.0 3rd Qu.:70.52 3rd Qu.:150.0
Max. :5000 Max. :107.0 Max. :91.76 Max. :500.0
NA's :15 NA's :1 NA's :14
Data Pre-Processing Summary
All columns converted to numeric data type
Renamed columns for clearer interpretation during analysis and reporting
Combined month, day of month and day of week into one data column
Data Exploration Summary
Summary statistics by day of week, month
Histograms to visualize distributions of data
Correlation coefficients for all features with respect to Ozone levels (output)
Strong Positive Correlations: Humidity_LAX, Pressure_afb, IBT_LAX, Temp_EM, and Temp_sandburg
Strong Negative Correlations: IBH_LAX and Visibility_LAX
Exploration of the Missing Data
The total number of missing data points for each column is shown below as a count and as a percentage.
Most of the columns contain <5% missing values. Temp_EM contains 139 missing values, which is 38.0% of the observations.
The total number of missing values in the dataset is 203 out of 4,768 (13 columns times 366 observations). This represents 4.3% of the entire dataset.
There are a few instances where more than one column has missing data, but the majority of rows with missing values are only missing Temp_EM
Statistical Testing for MCAR (Missing Completely at Random)
Randomness Testing of Missing Data The Little’s MCAR test was performed to analyze randomness. * Null Hypothesis (H0): The data is missing completely at random (MCAR). * Alternative Hypothesis (H1): The data is not missing completely at random. Based on the p-value(4.102052e-12) from the test, the null hypothesis is rejected, data is not missing completely at random (MCAR).
##Additionally, the Hawkin’s and the Non-Parametric tests were performed to analyze missing data randomness.
Test Data for Normality In order to interpret the tests, the data must be checked for normality. The Shapiro-Wilk and the Anderson-Darling tests were performed.
Shapiro-Wilk normality test
data: ozone_levels
W = 0.91044, p-value = 8.37e-14
Anderson-Darling normality test
data: ozone_levels
A = 10.027, p-value < 2.2e-16
A histogram and QQ plot were also created
Run Hawkins and Non-Parametric Tests
Call:
MissMech::TestMCARNormality(data = .)
Number of Patterns: 4
Total number of cases used in the analysis: 354
Pattern(s) used:
Temp_EM Month Day_of_month Day_of_week Pressure_afb
group.1 NA 1 1 1 1
group.2 1 1 1 1 1
group.3 1 1 1 1 1
group.4 1 1 1 1 NA
Wind_speed_LAX Humidity_LAX Temp_sandburg IBH_LAX
group.1 1 1 1 1
group.2 1 1 1 1
group.3 1 NA 1 NA
group.4 1 1 1 1
Pressure_gradient IBT_LAX Visibility_LAX Number of cases
group.1 1 1 1 129
group.2 1 1 1 206
group.3 1 NA 1 9
group.4 1 1 1 10
Test of normality and Homoscedasticity:
-------------------------------------------
Hawkins Test:
P-value for the Hawkins test of normality and homoscedasticity: 0.0113103
Either the test of multivariate normality or homoscedasticity (or both) is rejected.
Provided that normality can be assumed, the hypothesis of MCAR is
rejected at 0.05 significance level.
Non-Parametric Test:
P-value for the non-parametric test of homoscedasticity: 0.2743229
Reject Normality at 0.05 significance level.
There is not sufficient evidence to reject MCAR at 0.05 significance level.
Hawkins Test
P-value for the Hawkins test of normality and homoscedasticity: 0.0113103
This test checks for both multivariate normality and homoscedasticity (equal variances).
Interpretation: Since the p-value is less than 0.05, the null hypothesis of multivariate normality or homoscedasticity (or both) is rejected. This means that the data does not follow a multivariate normal distribution, or the variances are not equal across groups.
MCAR Hypothesis: Provided that the above tests show that the data is not normally distributed, the hypothesis of MCAR (Missing Completely At Random) is not rejected at the 0.05 significance level.
Non-Parametric Test
P-value for the non-parametric test of homoscedasticity: 0.2743229
This test checks for homoscedasticity without assuming normality.
Interpretation: Since the p-value is greater than 0.05, there is not enough evidence to reject the null hypothesis of homoscedasticity. This suggests that the variances are equal across groups.
MCAR Hypothesis: There is not sufficient evidence to reject the hypothesis of MCAR at the 0.05 significance level.
Further Visualization of the Missing Data: Missingness Patterns
The data were further visualized using the gg_miss_fct() function to plot the proportion of missing values for each column (y-axis) broken down by one variable as a time (x-axis). The purpose of this is to determine if there are discernible patterns in the missingness of the data. This provides additional supportive information for classifying the data as MCAR, MAR, or MNAR.
Focusing on Temp_EM since it is the column with the highest number of missing values, the data appear to be missing randomly with respect to most of the variables. With respect to the date columns, the pattern of missingness appears to be random with respect to day of the month. However, it is apparent that there is a high proportion of missing data on days 6 and 7 of the week (Saturdays and Sundays). This indicates that Temp_EM was likely not measured on the weekends throughout the year. The values for Temp_EM also appear to be missing randomly with respect to month, though there is a notable high frequency of missing data around the month of May.
For the other variables in the dataset, there does not appear to be any obvious patterns to the missing data. Most importantly, the frequency of missing data for Temp_EM does not appear to depend on the output variable (ozone reading). Therefore, we can most likely proceed to imputation with the assumption that our data are missing completely at random (MCAR).
# Create gg_miss_fct plots with adjusted themesp1 <-gg_miss_fct(ozone2, fct = Month) +ggtitle("Missing Data by Month") +theme(axis.text.x =element_text(angle =45, hjust =1)) +theme(plot.title =element_text(size=8))p2 <-gg_miss_fct(ozone2, fct = Day_of_month) +ggtitle("Missing Data by Day of Month") +theme(axis.text.x =element_text(angle =45, hjust =1)) +theme(plot.title =element_text(size=8))p3 <-gg_miss_fct(ozone2, fct = Day_of_week) +ggtitle("Missing Data by Day of Week") +theme(axis.text.x =element_text(angle =45, hjust =1)) +theme(plot.title =element_text(size=8))p4 <-gg_miss_fct(ozone2, fct = Ozone_reading) +ggtitle("Missing Data by Ozone Reading") +theme(axis.text.x =element_text(angle =45, hjust =1)) +theme(plot.title =element_text(size=8))p5 <-gg_miss_fct(ozone2, fct = Pressure_afb) +ggtitle("Missing Data by Solar Radiation") +theme(axis.text.x =element_text(angle =45, hjust =1)) +theme(plot.title =element_text(size=8))p6 <-gg_miss_fct(ozone2, fct = Wind_speed_LAX) +ggtitle("Missing Data by Wind Speed") +theme(axis.text.x =element_text(angle =45, hjust =1)) +theme(plot.title =element_text(size=8))p7 <-gg_miss_fct(ozone2, fct = Humidity_LAX) +ggtitle("Missing Data by Humidity (LAX)") +theme(axis.text.x =element_text(angle =45, hjust =1)) +theme(plot.title =element_text(size=8))p8 <-gg_miss_fct(ozone2, fct = Temp_sandburg) +ggtitle("Missing Data by Temperature (Sandburg)") +theme(axis.text.x =element_text(angle =45, hjust =1)) +theme(plot.title =element_text(size=8))p9 <-gg_miss_fct(ozone2, fct = Temp_EM) +ggtitle("Missing Data by Temperature (EM)") +theme(plot.title =element_text(size=8))p10 <-gg_miss_fct(ozone2, fct = IBH_LAX) +ggtitle("Missing Data by IBH_LAX") +theme(plot.title =element_text(size=8))p11 <-gg_miss_fct(ozone2, fct = Pressure_gradient) +ggtitle("Missing Data by Pressure Gradient") +theme(plot.title =element_text(size=8))p12 <-gg_miss_fct(ozone2, fct = IBT_LAX) +ggtitle("Missing Data by IBT_LAX") +theme(plot.title =element_text(size=8))p13 <-gg_miss_fct(ozone2, fct = Visibility_LAX) +ggtitle("Missing Data by Visibility_LAX") +theme(plot.title =element_text(size=8))# Arrange the plots into grids with proper spacinggrid1 <-grid.arrange(p1, p2, p3, p4, nrow =2)
grid2 <-grid.arrange(p5, p6, p7, p8, nrow =2)
grid3 <-grid.arrange(p9, p10, p11, nrow =2)
grid4 <-grid.arrange(p12, p13, nrow =1)
Missing Data Imputation Using Simple Methods
Missing data were deleted using listwise deletion or feature selection as a baseline comparison for imputation methods. Additionally, missing data imputed using mean, median or mode as a demonstration of simple imputation techniques. Listwise deletion is simply deleting all rows that contain a missing value. Feature removal is where a column is deleted if its total percentage of missing values is over a set threshold (we chose 20%). With mean, median, or mode imputation, all NA values are replaced by a single value for each column: the mean, median, or mode of that column. One dataframe was created for each method, resulting in a total of five dataframes:
Feature selection (column deletion) In this case, the Temp_EM column is dropped because more than 20% of the values are missing. Remaining rows with NAs are also dropped so that machine learning models can be applied.
# Function to drop column if quantity of missing values is over the thresholddrop_na_columns <-function(data, threshold) { na_counts <-colSums(is.na(data)) na_proportion <- na_counts /nrow(data) data <- data[, na_proportion <= threshold]return(data)}# Define threshold (e.g., 20% NA allowed)threshold <-0.20# Drop columns based on the NA thresholddropCol_data <-drop_na_columns(ozone2, threshold) # the column Temp_EM gets droppeddropCol_data <-data.frame(dropCol_data)dropCol_data <-na.omit(dropCol_data)head(dropCol_data)
Create a new dataset where all missing values are replaced with the mean of the column.
# Function for mean imputationmean_impute <-function(x) { x[is.na(x)] <-mean(x, na.rm =TRUE)return(x)}imputed_data_mean <-apply(ozone2, 2, mean_impute)imputed_data_mean <-data.frame(imputed_data_mean)head(imputed_data_mean)
Create a new dataset where all missing values are replaced with the median of the column.
# Function for median imputationmedian_impute <-function(x) { x[is.na(x)] <-median(x, na.rm =TRUE)return(x)}imputed_data_median <-apply(ozone2, 2, median_impute)imputed_data_median <-data.frame(imputed_data_median)head(imputed_data_median)
Create a new dataset where all missing values are replaced with the mode of the column.
# Function for mode imputation (using the most common value)mode_impute <-function(x) { mode_val <-as.numeric(names(sort(table(x), decreasing =TRUE)[1])) x[is.na(x)] <- mode_valreturn(x)}imputed_data_mode <-apply(ozone2, 2, mode_impute)imputed_data_mode <-data.frame(imputed_data_mode)head(imputed_data_mode)
Missing Data Imputation Using MICE (Multiple Imputation Method) and missForest (Machine Learning Method)
Next, the missing data were imputed using MICE and Miss_Forest methods.
missForest
library(missForest)# verbose = If 'TRUE' the user is supplied with additional output between iterations# xtrue = Complete data matrixozone2_mf <-missForest(ozone2, xtrue = ozone2, verbose =TRUE)# convert back to data frameozone2_mf <-as.data.frame(ozone2_mf$ximp)print(sum(is.na(ozone2_mf)))## The final results can be accessed directly. The estimated error:ozone2_mf$OOBerror## The true imputation error (if available):ozone2_mf$error
MICE
library(mice)# Impute missing values using MICE# pmm = Predictive Mean Matching (suitable for continuous variables like temperature, wind, etc.)# m = 5: number of imputed datasets to create.# maxit = 50: max number of iterationsozone2_mice <-mice(ozone2, method ="pmm", m =5, maxit =50)# extracts the completed datasets from the mice objectozone2_mice <-complete(ozone2_mice)# Convert completed data to data frameozone2_mice <-as.data.frame(ozone2_mice)print(sum(is.na(ozone2_mice)))
Model-Fitting of Imputed Datasets
Make Predictions Using Data Where Missing Values Were Deleted Using Listwise Deletion (Deletion of all missing rows with missing data)
The listwise deleted dataset was split into testing and training datasets using a split ratio of 0.75. Models were then fit using the following algorithms:
Random Forest
K-Nearest Neighbors (KNN)
Decision Tree
After models were fit to the training data, predictions were made on the unseen test data. RMSE was then reported for each model.
Load additional libraries and split the data into training and testing sets
#Load additional libraries for model fittinglibrary(caret) # for fitting KNN modelslibrary(e1071)library(rsample) # for creating validation splitslibrary(recipes) # for feature engineeringlibrary(randomForest)library(rpart)# decision treelibrary(tidymodels) library(class) library(vip) # Split the data into training and testing setsset.seed(123)data_split_dropNA <-initial_split(dropNA_data, prop =0.75)train_dropNA <-training(data_split_dropNA)test_dropNA <-testing(data_split_dropNA)# Split data into predictors and targetX <- train_dropNA[, -1] # Featuresy <- train_dropNA$Ozone_reading # Target# Function to calculate RMSErmse <-function(pred, actual) {sqrt(mean((pred - actual)^2))}
Random Forest
# Random forest modelrf_dropNA <-randomForest(Ozone_reading ~ ., data = train_dropNA)# make predictionsrf_dropNA_pred <-predict(rf_dropNA, newdata = test_dropNA)# Plot variable importancevarImpPlot(rf_dropNA, main ="Variable Importance Plot for Random Forest Model")
# Decision tree modeltree_dropNA <-rpart(Ozone_reading ~ ., data = train_dropNA)# make predictionstree_dropNA_pred <-predict(tree_dropNA, newdata = test_dropNA)# Plot variable importancevar_importance <-varImp(tree_dropNA)barplot(var_importance$Overall, main ="Variable Importance for Decision Tree", xlab ="Variable", ylab ="Importance")
tree_dropNA_rmse <-rmse(tree_dropNA_pred, test_dropNA$Ozone_reading)cat("Decision Tree RMSE:", tree_dropNA_rmse, "\n")
Decision Tree RMSE: 5.799608
Make Predictions Using Data Where Columns with >20% Missing Data Were Deleted (Temp_EM column)
The deleted column dataset was split into testing and training datasets using a split ratio of 0.75. Models were then fit using the following algorithms:
Random Forest
K-Nearest Neighbors (KNN)
Decision Tree
After models were fit to the training data, predictions were made on the unseen test data. RMSE was then reported for each model.
For the column deletion method (feature selection), random forest had the best model fit, with an RMSE of 3.66.
Random Forest
Random Forest RMSE: 3.659657
KNN
KNN RMSE: 5.548309
Decision Tree
Decision Tree RMSE: 4.266184
Make Predictions Using Data Where Missing Values were Imputed using the Mean
The mean imputed dataset was split into testing and training datasets using a split ratio of 0.75. Models were then fit using the following algorithms:
Random Forest
K-Nearest Neighbors (KNN)
Decision Tree
After models were fit to the training data, predictions were made on the unseen test data. RMSE was then reported for each model.
For the mean imputation method, random forest had the best model fit, with an RMSE of 4.90.
Random Forest RMSE: 4.901004
KNN
KNN RMSE: 6.117349
Decision Tree
Decision Tree RMSE: 5.213247
Make Predictions Using Data Where Missing Values were Imputed using the Median
The mean imputed dataset was split into testing and training datasets using a split ratio of 0.75. Models were then fit using the following algorithms:
Random Forest
K-Nearest Neighbors (KNN)
Decision Tree
After models were fit to the training data, predictions were made on the unseen test data. RMSE was then reported for each model.
For the mean imputation method, random forest had the best model fit, with an RMSE of 5.07.
Random Forest
Random Forest RMSE: 5.073677
KNN
KNN RMSE: 6.298989
Decision Tree
Decision Tree RMSE: 5.249752
Make Predictions Using Data Where Missing Values were Imputed using the Mode
The mean imputed dataset was split into testing and training datasets using a split ratio of 0.75. Models were then fit using the following algorithms:
Random Forest
K-Nearest Neighbors (KNN)
Decision Tree
After models were fit to the training data, predictions were made on the unseen test data. RMSE was then reported for each model.
For the mode imputation method, random forest had the best model fit, with an RMSE of 5.40.
Random Forest
Random Forest RMSE: 5.398429
KNN
KNN RMSE: 6.371706
Decision Tree
Decision Tree RMSE: 6.603588
Make Predictions using data where missing values were imputed with complex methods:
Make Predictions Using Data Where Missing Values were Imputed using missForest Method
The missForest imputed dataset was split into testing and training datasets using a split ratio of 0.75. Models were then fit using the following algorithms:
Random Forest
K-Nearest Neighbors (KNN)
Decision Tree
After models were fit to the training data, predictions were made on the unseen test data. RMSE was then reported for each model.
For the missForest imputation method, random forest had the best model fit, with an RMSE of 4.46.
Random Forest
Random Forest RMSE: 4.493497
KNN
KNN RMSE: 6.118703
Decision Tree
Decision Tree RMSE: 5.537124
Make Predictions Using Data Where Missing Values were Imputed using MICE Method
The MICE imputed dataset was split into testing and training datasets using a split ratio of 0.75. Models were then fit using the following algorithms:
Random Forest
K-Nearest Neighbors (KNN)
Decision Tree
After models were fit to the training data, predictions were made on the unseen test data. RMSE was then reported for each model.
For the MICE imputation method, random forest had the best model fit, with an RMSE of 4.70.
Random Forest
Random Forest RMSE: 4.367134
KNN
KNN RMSE: 6.142781
Decision Tree
Decision Tree RMSE: 5.077593
Summarize Model Performance for Each Imputation Methodology
The RMSE for each model and imputation method combination was summarized into a data frame. The results are shown below. The best model fit (lowest RMSE) was obtained when using feature selection (column deletion) for imputation of missing data and the random forest algorithm for model training with the resulting dataset. Mean imputation with the Decision Tree model fitting was the second best combination, followed by mode imputation with Random Forest.
In summary, maintaining the integrity and dependability of data analysis procedures depends on the management of missing data. This work demonstrates various methodologies that can be used to handle missing data, including listwise and feature selection deletion methods, machine learning-based algorithms like missForest, single imputation techniques such as mean, mode, median imputation, and multiple imputation using MICE. The dataset “Ozone”, from the mlbench package in R was used to demonstrate the techniques for either deleting or imputing missing data. Statistical tests and visualization techniques to help classify data as MCAR, MAR, and NMAR were also demonstrated. With respect to the “Ozone” dataset, deletion of the column Temp_EM, which contains ~38% missing data, was found to be the most effective method for handling missing data. This was demonstrated through machine learning models trained to datasets where missing data were handled using the various techniques described above. A total of seven datasets were created, and each dataset was trained and tested with KNN, decision tree, and random forest algorithms. The random forest model in combination with feature selection for handling missing data resulted in the lowest RMSE with respect to the test (unseen) dataset. This may have been the best result in this instance because this data appear to be MCAR based on statistical testing. It is best practice to test a variety of different methods, as was performed in this work, and predefine success criteria when handling missing data. To provide reliable and accurate results from data analysis, the imputation method selected should ultimately be in line with the unique features of the dataset and the analytical objectives.